Predicting New Construction in Philadelphia

MUSA 508 Final Project

Author

Laura Frances and Nissim Lebovits

Published

December 13, 2023

1 Summary

This project presents a comprehensive statistical analysis of urban development in Philadelphia, focusing on the intricate balance between growth, affordability, and gentrification. Central to this study is the development of a random forest model, which has demonstrated remarkable effectiveness in predicting future development patterns with a low mean absolute error. By accurately forecasting where growth is likely to occur, this model serves as a critical tool for urban planners and policymakers. It can be strategically leveraged to promote proactive upzoning of high-priority parcels, aligning current zoning more closely with anticipated development. This approach is particularly aimed at fostering affordable housing in Philadelphia, addressing one of the city’s most pressing challenges. Through this blend of data-driven insights and targeted policy recommendations, the project seeks to guide Philadelphia towards a more equitable and sustainable urban future.

2 Introduction

Show the code
required_packages <- c("tidyverse", "sf", "acs", "tidycensus", "sfdep", "kableExtra", "conflicted",
                       "gganimate", "tmap", "gifski", "transformr", "ggpubr", "randomForest", "janitor",
                       'igraph', "plotly", "ggcorrplot", "Kendall")
suppressWarnings(
install_and_load_packages(required_packages)
)

source("utils/viz_utils.R")

Philadelphia, the sixth-largest city in the United States, presents a unique case study in urban development. Despite its ranking as the 42nd in terms of cost of living, the city’s approach to housing supply and affordability stands out. In the complex landscape of urban growth, Philadelphia treads a precarious line, balancing the benefits of new construction with the challenges of affordability and gentrification. This statistical analysis aims to delve into these dynamics, examining how the city’s growth, often perceived as a political rather than a strategic or ‘smart’ process, impacts its residents and neighborhoods.

The citywide impact of development in Philadelphia suggests a positive trend towards increased affordability and choice for its inhabitants. However, this macro view masks the localized burdens of development, where the cost and impact of new construction can be disproportionately felt. The central question then arises: How can Philadelphia grow equitably? The answer may lie in the concept of ‘smart growth’, a strategic approach to urban planning. But in reality, growth in Philadelphia is more influenced by political forces than by comprehensive, well-thought-out plans. Comprehensive plans, ideally set on ten-year timelines, often become mere suggestions, subject to the whims of city council members rather than serving as steadfast guides for development.

The statistical analysis will further explore the critical issue of zoning mismatch, a barrier to ensuring dense and equitable growth in Philadelphia. The current system, where permitting is largely controlled by individual council members (‘council-member-time’), often leads to a situation where exceptions to zoning rules unfairly become the norm. This analysis aims to provide dynamic data and insights to empower zoning advocates, offering them the tools to effectively counter these forces. By highlighting the need for intervention and the reduction of zoning mismatches, the study seeks to contribute to a broader understanding of how Philadelphia can navigate its growth challenges and opportunities, moving towards a more equitable and sustainable urban future.

  1. Predict development pressure: how do we define “a lot of development”?

  2. Define affordability burden: how do we define “affordability burden”? – % change year over year in population that is experience rate burden (will probably see extreme tipping points), growing population, % change in area incomes

  3. Identify problem zoning

  4. Calculate number of connected parcels

  5. Predict development pressure at the block level

  6. Identify not burdened areas

  7. Identify problem zoning

  8. Calcualte number of connected parcels

  9. Advocate for upzoning in parcels where there is local development pressure, no affordability burden, problem zoning, and high number of connected parcels

  • transit
  • zoning (OSM)
  • land use (OSM)
Show the code
urls <- c(
  roads = 'https://opendata.arcgis.com/datasets/261eeb49dfd44ccb8a4b6a0af830fdc8_0.geojson', # for broad and market
  council_dists = "https://opendata.arcgis.com/datasets/9298c2f3fa3241fbb176ff1e84d33360_0.geojson",
  building_permits = building_permits_path,
  permits_bg = final_dataset_path,
  zoning = "https://opendata.arcgis.com/datasets/0bdb0b5f13774c03abf8dc2f1aa01693_0.geojson",
  opa = "data/opa_properties.geojson",
  ols_preds = 'data/model_outputs/ols_preds.geojson',
  rf_test_preds = 'data/model_outputs/rf_test_preds.geojson',
  rf_val_preds = 'data/model_outputs/rf_val_preds.geojson',
  rf_proj_preds = 'data/model_outputs/rf_proj_preds.geojson'
  
)

suppressMessages({
  invisible(
    imap(urls, ~ assign(.y, phl_spat_read(.x), envir = .GlobalEnv))
  )
})

broad_and_market <- roads %>% filter(ST_NAME %in% c('BROAD',"MARKET") | SEG_ID %in% c(440370, 421347,421338,421337,422413,423051,440403,440402,440391,440380))

council_dists <- council_dists %>%
                    select(DISTRICT)

building_permits <- building_permits %>%
                      filter(permittype %in% c("RESIDENTIAL BUILDING", "BP_ADDITION", "BP_NEWCNST"))
Show the code
tm_shape(permits_bg %>% filter(year == 2023)) +
        tm_polygons(col = palette[4], alpha = 0.5, colorNA = "lightgrey") +
  tm_shape(broad_and_market) +
  tm_lines(col = "darkgrey") +
  tm_layout(frame = FALSE, title = "Philadelphia Block Groups") 

Show the code
# tm_out <- tm_shape(acs22) +
#         tm_polygons(col = "Ext_Rent_Burden", border.alpha = 0, palette = mono_5_orange, style = "fisher", colorNA = "lightgrey", title = "Extreme Rent Burden") +
#   tm_shape(broad_and_market) +
#   tm_lines(col = "darkgrey") +
#   tm_layout(frame = FALSE, title = "Extreme Rent Burden\nPhiladelphia, 2022") 
# 
# tmap_save(tm_out, "assets/extreme_rent_burden_2022.jpg", dpi = 500)

tm <- tm_shape(permits_bg %>% filter(!year %in% c(2012, 2024))) +
        tm_polygons(col = "permits_count", border.alpha = 0, palette = mono_5_green, style = "fisher", colorNA = "lightgrey") +
  tm_facets(along = "year") +
  tm_shape(broad_and_market) +
  tm_lines(col = "darkgrey") +
  tm_layout(frame = FALSE) 

suppressMessages(
tmap_animation(tm, "assets/permits_animation.gif", delay = 50)
)

Philadelphia Building Permits, 2013 - 2022

3 Methods

3.1 Data

Show the code
ggplot(building_permits %>% filter(!year %in% c(2024)), aes(x = as.factor(year))) +
  geom_bar(fill = palette[1], color = NA, alpha = 0.7) +
  labs(title = "Permits per Year") +
  theme_minimal()

Show the code
ggplot(permits_bg %>% st_drop_geometry %>% filter(!year %in% c(2024)), aes(x = permits_count)) +
  geom_histogram(fill = palette[1], color = NA, alpha = 0.7) +
  labs(title = "Permits per Block Group per Year",
       subtitle = "Log-Transformed") +
  scale_x_log10() +
  facet_wrap(~year) +
  theme_minimal()

Show the code
perms_x_dist <- st_join(building_permits, council_dists)

perms_x_dist_sum <- perms_x_dist %>%
                  st_drop_geometry() %>%
                  group_by(DISTRICT, year) %>%
                  summarize(permits_count = n())

perms_x_dist_mean = perms_x_dist_sum %>%
                      group_by(year) %>%
                      summarize(permits_count = mean(permits_count)) %>%
                      mutate(DISTRICT = "Average")

perms_x_dist_sum <- bind_rows(perms_x_dist_sum, perms_x_dist_mean) %>%
                        mutate(color = ifelse(DISTRICT != "Average", 0, 1))

ggplotly(
ggplot(perms_x_dist_sum %>% filter(year > 2013, year < 2024), aes(x = year, y = permits_count, color = as.character(color), group = interaction(DISTRICT, color))) +
  geom_line(lwd = 0.7) +
  labs(title = "Permits per Year by Council District",
       y = "Total Permits") +
  # facet_wrap(~DISTRICT) +
  theme_minimal() +
  theme(axis.title.x = element_blank(),
        legend.position = "none") +
  scale_color_manual(values = c(palette[5], palette[1]))
)
3.1.0.1 Correlation Tests
Show the code
corr_vars <- c("total_pop",
               "med_inc",
               "percent_nonwhite",
               "percent_renters",
               "rent_burden",
               "ext_rent_burden")

corr_dat <- permits_bg %>% select(all_of(corr_vars), permits_count) %>% select(where(is.numeric)) %>% st_drop_geometry() %>% unique() %>% na.omit()

corr <- round(cor(corr_dat), 2)
p.mat <- cor_pmat(corr_dat)

ggcorrplot(corr, p.mat = p.mat, hc.order = TRUE,
    type = "full", insig = "blank", lab = TRUE, colors = c(palette[2], "white", palette[3])) +
  annotate(
  geom = "rect",
  xmin = .5, xmax = 7.5, ymin = 4.5, ymax = 5.5,
  fill = "transparent", color = "red", alpha = 0.5
)

3.1.0.2 Emerging Hotspots

Local Moran’s i for 2023

Show the code
lisa <- permits_bg %>% 
  filter(year == 2023) %>%
  mutate(nb = st_contiguity(geometry),
                         wt = st_weights(nb),
                         permits_lag = st_lag(permits_count, nb, wt),
          moran = local_moran(permits_count, nb, wt)) %>% 
  tidyr::unnest(moran) %>% 
  mutate(pysal = ifelse(p_folded_sim <= 0.1, as.character(pysal), NA),
         hotspot = case_when(
           pysal == "High-High" ~ "Signficant",
           TRUE ~ "Not Signficant"
         ))

# 
# palette <- c("High-High" = "#B20016", 
#              "Low-Low" = "#1C4769", 
#              "Low-High" = "#24975E", 
#              "High-Low" = "#EACA97")

morans_i <- tm_shape(lisa) +
  tm_polygons(col = "ii", border.alpha = 0, style = "jenks", palette = mono_5_green)

p_value <- tm_shape(lisa) +
  tm_polygons(col = "p_ii", border.alpha = 0, style = "jenks", palette = mono_5_green)

sig_hotspots <- tm_shape(lisa) +
  tm_polygons(col = "hotspot", border.alpha = 0, style = "cat", palette = c(palette[2], palette[3]), textNA = "Not a Hotspot")

tmap_arrange(morans_i, p_value, sig_hotspots, ncol = 3)

Emergeging hotspots

Show the code
# stc <- as_spacetime(permits_bg %>% select(permits_count, geoid10, year) %>% na.omit(),
#                  .loc_col = "geoid10",
#                  .time_col = "year")
# 
# # conduct EHSA
# ehsa <- emerging_hotspot_analysis(
#   x = stc,
#   .var = "permits_count",
#   k = 1,
#   nsim = 5
# )
# 
# count(ehsa, classification)

3.1.1 Feature Engineering

Show the code
permits_bg_long <- permits_bg %>%
                    filter(!year %in% c(2024)) %>%
                    st_drop_geometry() %>%
                    pivot_longer(
                      cols = c(starts_with("lag"), dist_to_2022),
                      names_to = "Variable",
                      values_to = "Value"
                    )


ggscatter(permits_bg_long, x = "permits_count", y = "Value", facet.by = "Variable",
   add = "reg.line",
   add.params = list(color = palette[3], fill = palette[5]),
   conf.int = TRUE
   ) + stat_cor(method = "pearson", p.accuracy = 0.001, r.accuracy = 0.01)

3.2 Models

4 Results

Make sure to note that we train, test, and then validate. So these first models are based on 2022 data, and then we run another on 2023 (and then predict 2024 at the end).

4.1 OLS Regression

To begin, we run a simple regression incorporating three engineered groups of features: space lag, time lag, and distance to 2022. We include this last variable because of a Philadelphia tax abatement policy that led to a significant increase in residential development in the years immediately before 2022. We will use this as a baseline model to compare to our more complex models.

(We considered a Poisson model but found that it struggled with outliers.)

Show the code
ggplot(ols_preds, aes(x = permits_count, y = ols_preds)) +
  geom_point() +
  labs(title = "Predicted vs. Actual Permits",
       subtitle = "2022") +
  geom_smooth(method = "lm", se = FALSE)

Show the code
ggplot(ols_preds, aes(x = abs_error)) +
  geom_histogram() +
  labs(title = "Distribution of Absolute Error per Block Group",
       subtitle = "OLS, 2022")

Show the code
ols_mae <- paste0("MAE: ", round(mean(ols_preds$abs_error, na.rm = TRUE), 2))

ols_preds_map <- tm_shape(ols_preds) +
        tm_polygons(col = "ols_preds", border.alpha = 0, palette = mono_5_green, style = "fisher", colorNA = "lightgrey") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE) 

ols_error_map <- tm_shape(ols_preds) +
        tm_polygons(col = "abs_error", border.alpha = 0, palette = mono_5_orange, style = "fisher", colorNA = "lightgrey") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE) 

tmap_arrange(ols_preds_map, ols_error_map)

We find that our OLS model has an MAE of only MAE: 2.66–not bad for such a simple model! Still, it struggles most in the areas where we most need it to succeed, so we will try to introduce better variables and apply a more complex model to improve our predictions.

4.2 Random Forest Regression: Testing

We train and test up to 2022–we use this for model tuning and feature engineering.

Show the code
test_preds_map <- tm_shape(rf_test_preds) +
        tm_polygons(col = "rf_test_preds", border.alpha = 0, palette = mono_5_green, style = "fisher", colorNA = "lightgrey") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE) 

test_error_map <- tm_shape(rf_test_preds) +
        tm_polygons(col = "abs_error", border.alpha = 0, palette = mono_5_orange, style = "fisher", colorNA = "lightgrey") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE) 

tmap_arrange(test_preds_map, test_error_map)

Show the code
ggplot(rf_test_preds, aes(x = abs_error)) +
  geom_histogram() +
  labs(title = "Distribution of Absolute Error per Block Group",
       subtitle = "Random Forest, 2022")

Show the code
ggplot(rf_test_preds, aes(x = permits_count, y = rf_test_preds)) +
  geom_point() +
  labs(title = "Predicted vs. Actual Permits",
       subtitle = "2022") +
  geom_smooth(method = "lm", se = FALSE)

Show the code
rf_test_mae <- paste0("MAE: ", round(mean(rf_test_preds$abs_error, na.rm = TRUE), 2))

4.3 Random Forest Regression: Validation

Having settled on our model features and tuning, we now validate on 2023 data.

Show the code
val_preds_map <- tm_shape(rf_val_preds) +
        tm_polygons(col = "rf_val_preds", border.alpha = 0, palette = mono_5_green, style = "fisher", colorNA = "lightgrey") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE) 

val_error_map <- tm_shape(rf_val_preds) +
        tm_polygons(col = "abs_error", border.alpha = 0, palette = mono_5_orange, style = "fisher", colorNA = "lightgrey") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE) 

tmap_arrange(val_preds_map, val_error_map)

Show the code
ggplot(rf_val_preds, aes(x = abs_error)) +
  geom_histogram() +
  labs(title = "Distribution of Absolute Error per Block Group",
       subtitle = "Random Forest, 2023")

Show the code
ggplot(rf_val_preds, aes(x = permits_count, y = rf_val_preds)) +
  geom_point() +
  labs(title = "Predicted vs. Actual Permits",
       subtitle = "2023") +
  geom_smooth(method = "lm", se = FALSE)

Show the code
rf_val_mae <- paste0("MAE: ", round(mean(rf_val_preds$abs_error, na.rm = TRUE), 2))

5 Discussion

5.1 Accuracy

Predominately, our model overpredicts, which is better than underpredicting, as it facilitates new development.

Show the code
nbins <- as.integer(sqrt(nrow(rf_val_preds)))
vline <- mean(rf_val_preds$abs_error, na.rm = TRUE)

ggplot(rf_val_preds, aes(x = abs_error)) +
  geom_histogram(bins = nbins) +
  geom_vline(aes(xintercept = vline))

5.2 Generalizabiltiy

Show the code
rf_val_preds <- rf_val_preds %>%
                      mutate(race_comp = case_when(
                        percent_nonwhite >= .50 ~ "Majority Non-White",
                        TRUE ~ "Majority White"
                      ))

ggplot(rf_val_preds, aes(y = abs_error, color = race_comp)) +
  geom_boxplot(fill = NA)

We find that error is not related to affordability and actually trends downward with percent nonwhite. (This is probably because there is less total development happening there in majority-minority neighborhoods to begin with, so the magnitude of error is less, even though proportionally it might be more.) Error increases slightly with total pop. This makes sense–more people –> more development.

Show the code
ggplot(rf_val_preds, aes(y = abs_error, x = rent_burden)) + # or whatever the variable is
  geom_point() +
  geom_smooth(method = "lm", se= FALSE) +
  theme_minimal()

Show the code
ggplot(rf_val_preds, aes(y = abs_error, x = percent_nonwhite)) + # or whatever the variable is
  geom_point() +
  geom_smooth(method = "lm", se= FALSE) +
  theme_minimal()

Show the code
ggplot(rf_val_preds, aes(y = abs_error, x = total_pop)) + # or whatever the variable is
  geom_point() +
  geom_smooth(method = "lm", se= FALSE) +
  theme_minimal()

Show the code
ggplot(rf_val_preds, aes(y = abs_error, x = med_inc)) + # or whatever the variable is
  geom_point() +
  geom_smooth(method = "lm", se= FALSE) +
  theme_minimal()

How does this generalize across council districts? Don’t forget to refactor

Show the code
suppressMessages(
  ggplot(rf_val_preds, aes(x = reorder(district, abs_error, FUN = mean), y = abs_error)) +
    geom_boxplot(fill = NA) +
    labs(title = "MAE by Council District",
         y = "Mean Absolute Error",
         x = "Council District") +
    theme_minimal() +
    theme()
)

5.3 Assessing Upzoning Needs

We can identify conflict between projected development and current zoning.

Look at zoning that is industrial or residential single family in areas that our model suggests are high development risk for 2023:

Show the code
filtered_zoning <- zoning %>%
                     filter(str_detect(CODE, "RS") | str_detect(CODE, "I"),
                            CODE != "I2",
                            !str_detect(CODE, "SP"))


tm_shape(filtered_zoning) +
        tm_polygons(col = "CODE", border.alpha = 0, colorNA = "lightgrey") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE)

We can extract development predictions at the block level to these parcels and then visualize them by highest need.

Show the code
filtered_zoning <- st_join(
  filtered_zoning,
  rf_val_preds %>% select(rf_val_preds)
)

tm_shape(filtered_zoning) +
        tm_polygons(col = "rf_val_preds", border.alpha = 0, colorNA = "lightgrey", palette = mono_5_orange, style = "fisher") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE)

Show the code
tmap_mode('view')

filtered_zoning %>%
  filter(rf_val_preds > 10) %>%
tm_shape() +
        tm_polygons(col = "CODE", border.alpha = 0, colorNA = "lightgrey",
                    popup.vars = c('rf_val_preds', 'CODE')) +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE)

Furthermore, we can identify properties with high potential for assemblage, which suggests the ability to accomodate high-density, multi-unit housing.

Show the code
nbs <- filtered_zoning %>% 
  mutate(nb = st_contiguity(geometry))

# Create edge list while handling cases with no neighbors
edge_list <- tibble::tibble(id = 1:length(nbs$nb), nbs = nbs$nb) %>% 
  tidyr::unnest(nbs) %>% 
  filter(nbs != 0)

# Create a graph with a node for each row in filtered_zoning
g <- make_empty_graph(n = nrow(filtered_zoning))
V(g)$name <- as.character(1:nrow(filtered_zoning))

# Add edges if they exist
if (nrow(edge_list) > 0) {
  edges <- as.matrix(edge_list)
  g <- add_edges(g, c(t(edges)))
}

# Calculate the number of contiguous neighbors, handling nodes without neighbors
n_contiguous <- sapply(V(g)$name, function(node) {
  if (node %in% edges) {
    length(neighborhood(g, order = 1, nodes = as.numeric(node))[[1]])
  } else {
    1  # Nodes without neighbors count as 1 (themselves)
  }
})

filtered_zoning <- filtered_zoning %>%
                    mutate(n_contig = n_contiguous)

filtered_zoning %>%
  st_drop_geometry() %>%
  select(rf_val_preds,
         n_contig,
         OBJECTID,
         CODE) %>%
  filter(rf_val_preds > 10,
         n_contig > 2) %>%
  arrange(desc(rf_val_preds)) %>%
  kablerize(caption = "Poorly-Zoned Properties with High Development Risk")
Poorly-Zoned Properties with High Development Risk
rf_val_preds n_contig OBJECTID CODE
868 27.06613 3 1615 ICMX
1548 27.06613 3 2736 IRMX
1587 27.06613 3 2804 IRMX
3420 27.06613 3 6405 RSA5
4667 27.06613 3 9661 RSA5
9169 27.06613 4 20073 ICMX
1768 22.24860 3 3128 IRMX
3640 22.24860 3 6901 ICMX
7517 21.08143 3 16717 RSA5
3934 20.84390 3 7646 ICMX
12326 20.84390 4 25776 RSA5
4957 20.67827 3 10410 ICMX
4958 20.67827 3 10411 RSA5
4959 20.67827 3 10412 ICMX
5245 20.67827 3 11160 RSA5
4460 17.31087 3 9093 RSA5
7726 15.16207 3 17168 ICMX
13578 15.12210 3 27869 IRMX
5088 15.00973 3 10759 IRMX
4512 14.95163 5 9243 IRMX
6014 14.95163 6 13057 ICMX
3041 12.82333 3 5568 ICMX
9842 12.82333 3 21369 RSA5
9843 12.82333 3 21370 ICMX
9845 12.82333 3 21372 RSA5
7833 12.25843 3 17408 RSA5
3957 11.49060 3 7704 IRMX
6645 11.22550 3 14648 ICMX
7280 11.22550 3 16179 RSA5
9912 11.22550 3 21527 ICMX
2138 10.88253 4 3744 IRMX
8143 10.71940 3 18031 RSD3
8656 10.71940 3 19076 RSA3
9409 10.71940 4 20534 RSA2
10175 10.71940 3 22002 RSD1
12605 10.71940 3 26247 RSD1
4146 10.39053 3 8265 IRMX
5108 10.39053 4 10795 IRMX
Show the code
tmap_mode('view')

filtered_zoning %>%
  select(rf_val_preds,
         n_contig,
         OBJECTID,
         CODE) %>%
  filter(rf_val_preds > 10,
         n_contig > 2) %>%
tm_shape() +
        tm_polygons(col = "rf_val_preds", border.alpha = 0, colorNA = "lightgrey", palette = "viridis", style = "fisher",
                    popup.vars = c('rf_val_preds', 'n_contig', 'CODE'), alpha = 0.5) +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE)

5.4 Random Forest Regression: Projection

Just for shits and giggles, we can now predict for 2024.

Show the code
tmap_mode('plot')

tm_shape(rf_proj_preds) +
        tm_polygons(col = "rf_proj_preds", border.alpha = 0, palette = mono_5_green, style = "fisher", colorNA = "lightgrey") +
  tm_shape(broad_and_market) +
  tm_lines(col = "lightgrey") +
  tm_layout(frame = FALSE) 

6 Conclusion

7 Appendices